-
Notifications
You must be signed in to change notification settings - Fork 0
Added synthetic health data tool #35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds a comprehensive synthetic health data tool for generating realistic health data that mimics the structure of the LEGO dataset. The tool creates synthetic health data with spatial, temporal, and population-based variations using configurable Poisson parameters.
- Implements vectorized synthetic data generation with geographic and demographic effects
- Adds configuration support for synthetic health data parameters and file paths
- Provides caching mechanisms to optimize performance for large datasets
Reviewed Changes
Copilot reviewed 3 out of 4 changed files in this pull request and generated 5 comments.
File | Description |
---|---|
src/preprocessing_synth_health.py | Main synthetic health data preprocessing tool with ZCTA data loading, vectorized data generation, and caching |
conf/synthetic/config.yaml | Configuration file defining synthetic data parameters, paths, and Poisson distribution settings |
conf/datapaths/datapaths_cannon.yaml | Updated data paths configuration to include synthetic health output directory |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
src/preprocessing_synth_health.py
Outdated
'longitude': np.random.uniform(-125, -65, len(df_unique)), # Approximate US bounds | ||
'latitude': np.random.uniform(25, 50, len(df_unique)) |
Copilot
AI
Sep 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hardcoded longitude and latitude bounds (-125, -65, 25, 50) are magic numbers. Consider defining these as named constants at the module level, e.g., US_LONGITUDE_MIN = -125
, US_LONGITUDE_MAX = -65
, etc.
Copilot uses AI. Check for mistakes.
src/preprocessing_synth_health.py
Outdated
|
||
# Pre-calculate all spatial effects (these don't change by date) | ||
zcta_data = zcta_data.copy() | ||
zcta_hashes = [hash(str(zcta)) % 1000 / 100.0 for zcta in zcta_data['zcta']] |
Copilot
AI
Sep 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The expression % 1000 / 100.0
uses magic numbers. Consider defining these as named constants like HASH_MOD = 1000
and HASH_DIVISOR = 100.0
to make the normalization logic clearer.
Copilot uses AI. Check for mistakes.
lat_normalized = (zcta_data['latitude'] - 35) / 15 | ||
lon_normalized = (zcta_data['longitude'] + 95) / 30 |
Copilot
AI
Sep 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The geographic normalization uses magic numbers (35, 15, 95, 30). These appear to be US geographic center and scaling factors. Consider defining these as named constants like US_LAT_CENTER = 35
, LAT_SCALE_FACTOR = 15
, etc.
Copilot uses AI. Check for mistakes.
src/preprocessing_synth_health.py
Outdated
LOGGER.warning("No population data found, using synthetic population") | ||
df_pop = pd.DataFrame({ | ||
'zcta': df_unique['zcta'], | ||
'population': np.random.lognormal(mean=8.5, sigma=1.2, size=len(df_unique)) # Realistic population dist |
Copilot
AI
Sep 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lognormal distribution parameters (mean=8.5, sigma=1.2) are magic numbers. Consider defining these as named constants like SYNTHETIC_POP_MEAN = 8.5
and SYNTHETIC_POP_SIGMA = 1.2
or making them configurable.
Copilot uses AI. Check for mistakes.
for target_date in date_list: | ||
# Calculate seasonal effect for this date | ||
day_of_year = target_date.timetuple().tm_yday | ||
seasonal_effect = poisson_params['seasonal_amplitude'] * np.sin(2 * np.pi * day_of_year / 365.25) |
Copilot
AI
Sep 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The magic number 365.25 (days per year including leap years) should be defined as a named constant like DAYS_PER_YEAR = 365.25
for better maintainability.
Copilot uses AI. Check for mistakes.
- Remove unnecessary caching logic (pickle, @lru_cache) - Simplify ZCTA hash calculation using direct numeric conversion - Adjust Poisson parameters for realistic sparsity (~79% zeros) - Maintain mainland US filtering (32,657 ZCTAs) - Generate proper synthetic health data with horizons [0, 30, 90, 180]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 4 out of 5 changed files in this pull request and generated 5 comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
src/preprocessing_synth_health.py
Outdated
'longitude': np.random.uniform(-125, -65, len(df_unique)), # Approximate US bounds | ||
'latitude': np.random.uniform(25, 50, len(df_unique)) |
Copilot
AI
Sep 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hardcoded geographic bounds (-125, -65, 25, 50) should be defined as named constants at the module level to improve maintainability and make the values more explicit.
Copilot uses AI. Check for mistakes.
src/preprocessing_synth_health.py
Outdated
LOGGER.warning("No population data found, using synthetic population") | ||
df_pop = pd.DataFrame({ | ||
'zcta': df_unique['zcta'], | ||
'population': np.random.lognormal(mean=8.5, sigma=1.2, size=len(df_unique)) # Realistic population dist |
Copilot
AI
Sep 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lognormal distribution parameters (mean=8.5, sigma=1.2) should be defined as named constants or moved to the configuration file to make them configurable and improve maintainability.
Copilot uses AI. Check for mistakes.
lat_normalized = (zcta_data['latitude'] - 35) / 15 | ||
lon_normalized = (zcta_data['longitude'] + 95) / 30 |
Copilot
AI
Sep 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The normalization constants (35, 15, -95, 30) should be defined as named constants to clarify their purpose as geographic center points and ranges for the US.
lat_normalized = (zcta_data['latitude'] - 35) / 15 | |
lon_normalized = (zcta_data['longitude'] + 95) / 30 | |
lat_normalized = (zcta_data['latitude'] - US_LAT_CENTER) / US_LAT_RANGE | |
lon_normalized = (zcta_data['longitude'] - US_LON_CENTER) / US_LON_RANGE |
Copilot uses AI. Check for mistakes.
for target_date in date_list: | ||
# Calculate seasonal effect for this date | ||
day_of_year = target_date.timetuple().tm_yday | ||
seasonal_effect = poisson_params['seasonal_amplitude'] * np.sin(2 * np.pi * day_of_year / 365.25) |
Copilot
AI
Sep 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The value 365.25 (days per year accounting for leap years) should be defined as a named constant to make its purpose explicit.
Copilot uses AI. Check for mistakes.
LOGGER.info(f"Found {len(zcta_data)} ZCTAs for year {year} with complete data") | ||
|
||
# get days list for a given year with calendar days | ||
days_list = [(year, month, day) for month in range(1, 13) for day in range(1, calendar.monthrange(year, month)[1] + 1)] |
Copilot
AI
Sep 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This complex list comprehension for generating all days in a year should be extracted into a separate helper function with a descriptive name like generate_year_days()
to improve readability.
Copilot uses AI. Check for mistakes.
This pull request introduces a synthetic health data pipeline, including configuration, workflow, and preprocessing scripts to generate synthetic health datasets with realistic spatial, temporal, and population effects. The main changes are the addition of a configuration file for synthetic data generation, a Snakemake workflow to orchestrate the process, and a comprehensive preprocessing script that creates synthetic health data using vectorized operations and saves the results in the expected format.
Synthetic Health Data Pipeline Implementation
conf/synthetic/config.yaml
specifying parameters for synthetic data generation, including Poisson distribution parameters, date ranges, data paths, and debug options.snakefile_synthetic_health.smk
to automate the preprocessing of synthetic health data for each variable and year, producing daily output files.Preprocessing and Data Generation
src/preprocessing_synth_health.py
, a script that loads ZCTA geographic and population data, generates synthetic health counts using vectorized Poisson sampling with spatial and seasonal effects, and writes daily horizon files in parquet format.Configuration and Data Path Updates
conf/datapaths/datapaths_cannon.yaml
to add new output directories for synthetic health and covariate data.